Exploration of Loan Data from Prosper by Thomas Ott

This report explores Loan Data from Prosper. The dataset was last updated on March 11, 2014.

According to the Prosper prospectus, “Prosper is a pioneer of online marketplace lending that connects borrowers and investors. Our goal is to enable borrowers to access credit at affordable rates and provide investors with attractive risk-adjusted rates of return.”

I will look at this dataset from the perspective of a potential investor.

Univariate Plots Section

## [1] 113937     81

The dataset contains 81 variables and 113,937 observations. Each observation is a loan.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Looking through the data types of each variable, the date fields were imported as type Factor. These fields need to be convert to the Date format for the analysis.

Though Prosper started offering loans in 2006, several of the fields that will be used in this analysis weren’t introduced until July of 2009. For this analysis, loans originated prior to that timeframe will be removed from the dataset. The fields are CreditGrade, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn, ProsperRating..numeric, ProsperRating..Alpha, and ProsperScore.

## [1] 84837    81

With the date variables converted and the loans that originated before July 2009 removed, the new dataset contains 84,837 loans instead of 113,937.

##                    ListingKey    ListingNumber     ListingCreationDate 
##  17A93590655669644DB4C06:    6   Min.   : 416290   Min.   :2009-07-13  
##  349D3587495831350F0F648:    4   1st Qu.: 558146   1st Qu.:2012-02-12  
##  47C1359638497431975670B:    4   Median : 739352   Median :2013-03-28  
##  8474358854651984137201C:    4   Mean   : 773942   Mean   :2012-11-04  
##  DE8535960513435199406CE:    4   3rd Qu.: 979467   3rd Qu.:2013-10-26  
##  04C13599434217079754AEE:    3   Max.   :1255725   Max.   :2014-03-10  
##  (Other)                :84812                                         
##   CreditGrade         Term                       LoanStatus   
##         :84837   Min.   :12.00   Current              :56576  
##  A      :    0   1st Qu.:36.00   Completed            :19650  
##  AA     :    0   Median :36.00   Chargedoff           : 5335  
##  B      :    0   Mean   :42.49   Defaulted            : 1004  
##  C      :    0   3rd Qu.:60.00   Past Due (1-15 days) :  806  
##  D      :    0   Max.   :60.00   Past Due (31-60 days):  363  
##  (Other):    0                   (Other)              : 1103  
##    ClosedDate          BorrowerAPR       BorrowerRate     LenderYield    
##  Min.   :2009-08-27   Min.   :0.04583   Min.   :0.0400   Min.   :0.0300  
##  1st Qu.:2012-06-14   1st Qu.:0.16328   1st Qu.:0.1359   1st Qu.:0.1259  
##  Median :2013-02-22   Median :0.21945   Median :0.1875   Median :0.1775  
##  Mean   :2012-12-22   Mean   :0.22666   Mean   :0.1960   Mean   :0.1860  
##  3rd Qu.:2013-09-11   3rd Qu.:0.29254   3rd Qu.:0.2574   3rd Qu.:0.2474  
##  Max.   :2014-03-10   Max.   :0.42395   Max.   :0.3600   Max.   :0.3400  
##  NA's   :58848                                                           
##  EstimatedEffectiveYield EstimatedLoss     EstimatedReturn   
##  Min.   :-0.1827         Min.   :0.00490   Min.   :-0.18270  
##  1st Qu.: 0.1157         1st Qu.:0.04240   1st Qu.: 0.07408  
##  Median : 0.1615         Median :0.07240   Median : 0.09170  
##  Mean   : 0.1687         Mean   :0.08031   Mean   : 0.09607  
##  3rd Qu.: 0.2243         3rd Qu.:0.11200   3rd Qu.: 0.11660  
##  Max.   : 0.3199         Max.   :0.36600   Max.   : 0.28370  
##                                                              
##  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :1.000           C      :18339         Min.   : 1.00  
##  1st Qu.:3.000           B      :15581         1st Qu.: 4.00  
##  Median :4.000           A      :14549         Median : 6.00  
##  Mean   :4.072           D      :14272         Mean   : 5.95  
##  3rd Qu.:5.000           E      : 9794         3rd Qu.: 8.00  
##  Max.   :7.000           HR     : 6934         Max.   :11.00  
##                          (Other): 5368                        
##  ListingCategory..numeric. BorrowerState                 Occupation   
##  Min.   : 0.000            CA     :10756   Other              :21314  
##  1st Qu.: 1.000            NY     : 5843   Professional       :10541  
##  Median : 1.000            TX     : 5633   Executive          : 3468  
##  Mean   : 3.313            FL     : 5406   Computer Programmer: 3234  
##  3rd Qu.: 3.000            IL     : 4264   Teacher            : 2887  
##  Max.   :20.000            OH     : 3372   Analyst            : 2734  
##                            (Other):49563   (Other)            :40659  
##       EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner
##  Employed     :67310   Min.   :  0.0            False:39998        
##  Full-time    : 7912   1st Qu.: 30.0            True :44839        
##  Self-employed: 4538   Median : 74.0                               
##  Other        : 3806   Mean   :103.1                               
##  Not employed :  649   3rd Qu.:148.0                               
##  Retired      :  366   Max.   :755.0                               
##  (Other)      :  256   NA's   :19                                  
##  CurrentlyInGroup                    GroupKey     DateCreditPulled    
##  False:82595                             :82815   Min.   :2008-01-23  
##  True : 2242      3D4D3366260257624AB272D:  310   1st Qu.:2012-02-15  
##                   783C3371218786870A73D20:  208   Median :2013-03-29  
##                   52EA3425051368132B80C96:  150   Mean   :2012-11-05  
##                   B0473364376920128370B13:   83   3rd Qu.:2013-10-28  
##                   FEF83377364176536637E50:   82   Max.   :2014-03-10  
##                   (Other)                : 1189                       
##  CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
##  Min.   :600.0         Min.   :619.0         Min.   :1951-01-01     
##  1st Qu.:660.0         1st Qu.:679.0         1st Qu.:1990-05-25     
##  Median :700.0         Median :719.0         Median :1996-01-04     
##  Mean   :699.4         Mean   :718.4         Mean   :1994-12-21     
##  3rd Qu.:720.0         3rd Qu.:739.0         3rd Qu.:2000-05-15     
##  Max.   :880.0         Max.   :899.0         Max.   :2012-12-22     
##                                                                     
##  CurrentCreditLines OpenCreditLines  TotalCreditLinespast7years
##  Min.   : 0.00      Min.   : 0.000   Min.   :  2.00            
##  1st Qu.: 7.00      1st Qu.: 6.000   1st Qu.: 18.00            
##  Median :10.00      Median : 9.000   Median : 26.00            
##  Mean   :10.51      Mean   : 9.528   Mean   : 27.66            
##  3rd Qu.:13.00      3rd Qu.:12.000   3rd Qu.: 35.00            
##  Max.   :59.00      Max.   :54.000   Max.   :125.00            
##                                                                
##  OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
##  Min.   : 0.000        Min.   :    0.0             Min.   : 0.0000     
##  1st Qu.: 4.000        1st Qu.:  156.0             1st Qu.: 0.0000     
##  Median : 7.000        Median :  311.0             Median : 0.0000     
##  Mean   : 7.386        Mean   :  430.7             Mean   : 0.9646     
##  3rd Qu.:10.000        3rd Qu.:  564.0             3rd Qu.: 1.0000     
##  Max.   :50.000        Max.   :13765.0             Max.   :27.0000     
##                                                                        
##  TotalInquiries  CurrentDelinquencies AmountDelinquent  
##  Min.   : 0.00   Min.   : 0.0000      Min.   :     0.0  
##  1st Qu.: 2.00   1st Qu.: 0.0000      1st Qu.:     0.0  
##  Median : 3.00   Median : 0.0000      Median :     0.0  
##  Mean   : 4.29   Mean   : 0.3223      Mean   :   950.9  
##  3rd Qu.: 6.00   3rd Qu.: 0.0000      3rd Qu.:     0.0  
##  Max.   :78.00   Max.   :51.0000      Max.   :463881.0  
##                                                         
##  DelinquenciesLast7Years PublicRecordsLast10Years
##  Min.   : 0.000          Min.   : 0.0000         
##  1st Qu.: 0.000          1st Qu.: 0.0000         
##  Median : 0.000          Median : 0.0000         
##  Mean   : 3.659          Mean   : 0.2852         
##  3rd Qu.: 2.000          3rd Qu.: 0.0000         
##  Max.   :99.000          Max.   :38.0000         
##                                                  
##  PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
##  Min.   : 0.000000         Min.   :     0         Min.   :0.0000     
##  1st Qu.: 0.000000         1st Qu.:  3824         1st Qu.:0.3300     
##  Median : 0.000000         Median :  9324         Median :0.6000     
##  Mean   : 0.009171         Mean   : 17939         Mean   :0.5642     
##  3rd Qu.: 0.000000         3rd Qu.: 20340         3rd Qu.:0.8300     
##  Max.   :20.000000         Max.   :999165         Max.   :2.5000     
##                                                                      
##  AvailableBankcardCredit  TotalTrades    
##  Min.   :     0          Min.   :  1.00  
##  1st Qu.:  1147          1st Qu.: 16.00  
##  Median :  4573          Median : 23.00  
##  Mean   : 11398          Mean   : 23.93  
##  3rd Qu.: 13909          3rd Qu.: 31.00  
##  Max.   :498374          Max.   :122.00  
##                                          
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.0800                     Min.   : 0.0000        
##  1st Qu.:0.8500                     1st Qu.: 0.0000        
##  Median :0.9500                     Median : 0.0000        
##  Mean   :0.9059                     Mean   : 0.7299        
##  3rd Qu.:1.0000                     3rd Qu.: 1.0000        
##  Max.   :1.0000                     Max.   :20.0000        
##                                                            
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $50,000-74,999:25621   False: 7332     
##  1st Qu.: 0.150    $25,000-49,999:24168   True :77505     
##  Median : 0.220    $100,000+     :15204                   
##  Mean   : 0.259    $75,000-99,999:14497                   
##  3rd Qu.: 0.320    $1-24,999     : 4653                   
##  Max.   :10.010    Not employed  :  649                   
##  NA's   :7295      (Other)       :   45                   
##  StatedMonthlyIncome                    LoanKey      TotalProsperLoans
##  Min.   :      0     CB1B37030986463208432A1:    6   Min.   :0.00     
##  1st Qu.:   3438     2DEE3698211017519D7333F:    4   1st Qu.:1.00     
##  Median :   5000     9F4B37043517554537C364C:    4   Median :1.00     
##  Mean   :   5932     D895370150591392337ED6D:    4   Mean   :1.46     
##  3rd Qu.:   7083     E6FB37073953690388BC56D:    4   3rd Qu.:2.00     
##  Max.   :1750003     0D8F37036734373301ED419:    3   Max.   :8.00     
##                      (Other)                :84812   NA's   :65051    
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :  0.00             Min.   :  0.0        
##  1st Qu.: 10.00             1st Qu.:  9.0        
##  Median : 18.00             Median : 17.0        
##  Mean   : 24.31             Mean   : 23.6        
##  3rd Qu.: 35.00             3rd Qu.: 34.0        
##  Max.   :141.00             Max.   :141.0        
##  NA's   :65051              NA's   :65051        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.66                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :65051                       NA's   :65051                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3800            1st Qu.:    0              
##  Median : 6500            Median : 1593              
##  Mean   : 8759            Mean   : 2920              
##  3rd Qu.:11800            3rd Qu.: 4122              
##  Max.   :72499            Max.   :23451              
##  NA's   :65051            NA's   :65051              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.0              Min.   :   0.00          
##  1st Qu.: -35.0              1st Qu.:   0.00          
##  Median :  -6.0              Median :   0.00          
##  Mean   :  -4.7              Mean   :  36.62          
##  3rd Qu.:  23.0              3rd Qu.:   0.00          
##  Max.   : 286.0              Max.   :1593.00          
##  NA's   :68206                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 1.00                 Min.   : 0.00              Min.   : 38063  
##  1st Qu.: 9.00                 1st Qu.: 4.00              1st Qu.: 60962  
##  Median :13.00                 Median :11.00              Median : 87797  
##  Mean   :14.47                 Mean   :16.02              Mean   : 86601  
##  3rd Qu.:19.00                 3rd Qu.:25.00              3rd Qu.:109100  
##  Max.   :41.00                 Max.   :55.00              Max.   :136486  
##  NA's   :78595                                                            
##  LoanOriginalAmount LoanOriginationDate  LoanOriginationQuarter
##  Min.   : 1000      Min.   :2009-08-03   Q4 2013:14450         
##  1st Qu.: 4000      1st Qu.:2012-02-23   Q1 2014:12172         
##  Median : 7500      Median :2013-04-09   Q3 2013: 9180         
##  Mean   : 9085      Mean   :2012-11-15   Q2 2013: 7099         
##  3rd Qu.:13500      3rd Qu.:2013-11-05   Q3 2012: 5632         
##  Max.   :35000      Max.   :2014-03-12   Q2 2012: 5061         
##                                          (Other):31243         
##                    MemberKey     MonthlyLoanPayment LP_CustomerPayments
##  720D3508651090808DC328F:    7   Min.   :   0.0     Min.   :   -2.35   
##  C70934206057523078260C7:    7   1st Qu.: 157.3     1st Qu.:  799.17   
##  E4AF3422677498955FFA00E:    7   Median : 252.0     Median : 2214.76   
##  18F6337949289842881D0A8:    6   Mean   : 292.0     Mean   : 3669.14   
##  3D6B34225353312993B9700:    6   3rd Qu.: 388.4     3rd Qu.: 4870.57   
##  43DB3366978035224D7D9E3:    6   Max.   :2251.5     Max.   :37369.16   
##  (Other)                :84798                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-589.95  
##  1st Qu.:  393.6              1st Qu.:  254.08   1st Qu.: -72.31  
##  Median : 1250.0              Median :  674.45   Median : -34.90  
##  Mean   : 2626.9              Mean   : 1042.26   Mean   : -54.65  
##  3rd Qu.: 3414.3              3rd Qu.: 1436.23   3rd Qu.: -14.28  
##  Max.   :35000.0              Max.   :10572.78   Max.   :   3.01  
##                                                                   
##  LP_CollectionFees   LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-4865.080   Min.   :  -94.2       Min.   : -504.4    
##  1st Qu.:    0.000   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.000   Median :    0.0       Median :    0.0    
##  Mean   :   -8.197   Mean   :  376.1       Mean   :  367.8    
##  3rd Qu.:    0.000   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.000   Max.   :25000.0       Max.   :25000.0    
##                                                               
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :   0.000                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:   0.000                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :   0.000                Median :1.0000   Median : 0.00000  
##  Mean   :   7.648                Mean   :0.9981   Mean   : 0.01756  
##  3rd Qu.:   0.000                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :7780.030                Max.   :1.0125   Max.   :19.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   :0.000000           Min.   :    0.000           Min.   :   1.00  
##  1st Qu.:0.000000           1st Qu.:    0.000           1st Qu.:   1.00  
##  Median :0.000000           Median :    0.000           Median :  32.00  
##  Mean   :0.008051           Mean   :    4.316           Mean   :  68.26  
##  3rd Qu.:0.000000           3rd Qu.:    0.000           3rd Qu.:  97.00  
##  Max.   :9.000000           Max.   :11000.000           Max.   :1189.00  
## 

The Summary of all the variables gives an initial idea of each variable’s range and central tendency.

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2009-08-03" "2012-02-23" "2013-04-09" "2012-11-15" "2013-11-05" 
##         Max. 
## "2014-03-12"

The LoanOriginationDate variable shows the date a loan was approved and disbursed to the borrower. The dates range from August 3, 2009 through March 12, 2014. You can see how the number of loans has steadily increased as the Prosper service has grown. There is a decrease in loans at the end of 2012 followed by a significant increase in 2013. Scanning through press releases on Prosper’s web site, a new management team backed by Sequoia Capital was put in place in January of 2013. The change in leadership was followed by the significant growth.

## 
##    12    36    60 
##  1613 58679 24545

Looking at the Term variable, all of the loans are either 12 months, 36 months, or 60 months. The most popular loan is 36 months, and the least popular is 12 months.

## # A tibble: 11 x 2
##                LoanStatus  proportion
##                    <fctr>       <dbl>
##  1             Chargedoff 0.062885298
##  2              Completed 0.231620637
##  3                Current 0.666878838
##  4              Defaulted 0.011834459
##  5 FinalPaymentInProgress 0.002416399
##  6   Past Due (>120 days) 0.000188597
##  7   Past Due (1-15 days) 0.009500572
##  8  Past Due (16-30 days) 0.003123637
##  9  Past Due (31-60 days) 0.004278793
## 10  Past Due (61-90 days) 0.003689428
## 11 Past Due (91-120 days) 0.003583342

The table after the chart shows the proportion of each loan status. The majority of loans are current, which at first seemed surprising to me. Looking back at the histogram of loan creation dates, we see that a lot of the loans have been created in the past couple of years, and from the term bar chart that most loans are 3 or 5 years long. With that in mind, it makes sense that a large portion of loans would still be active.

Prosper assigns a proprietary loss rating to each loan when the listing is created. The ratings are from AA (best) to HR (high risk) and indicate the level of risk associated with a listing. The ratings are based on a credit report pulled from a credit agency along with previous loans the borrower has with prosper (which is reflected in the variable ProsperScore).

Looking at a histogram of the ratings, the distribution is somewhat normal, favoring better ratings.

The ListingCategory variable shows the borrowers plans for the loan money. There are 21 categories. By far, the largest reason for a loan is debt consolidation.

## [1] 52

The largest number of loans were initiated in the state of California followed by New York, Texas and Florida. None were initiated in Iowa, Maine or North Dakota. Note that the District of Columbia was also included, which technically isn’t a state, plus a blank option that had zero selections, resulting in 52 possible values.

## [1] 68

There are 68 possible values that can be selected for the borrower’s occupation. The occupation of Other was most often selected followed by Professional, Executive, and Computer Programmer. One of the possible values is a blank, which probably means nothing was selected.

## [1] 9

Employment status of the borrower has nine possible values, seven of which were selected by borrowers. By far, Employed was most often selected. To me, Full-time, Part-time and Self-employed are just subsets of Employed. I’m not sure what the Other option could be outside of the available options, and it’s too generic to be meaningful.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    30.0    74.0   103.1   148.0   755.0      19

The EmploymentStatusDuration variable shows the length in months of the employment status. The max value is 755, which is 62.9 years! Though highly unusual, I suppose it could happen. The histogram starts high and drops as the amount of time employed increases. This shows that most applicants are relatively new to their current employer and are either new to the work force or recently changed jobs. The trend flattens out at around 20 through 25 years, possibly due to folks holding out for retirement. Transforming the graph using sqrt shows a pattern of some longevity before moving on to another job.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    41250    60000    71178    85000 21000035

The StatedMonthlyIncome field has been annualized to better match the IncomeRange variable. The mean and median of stated income roughly match the histogram of income ranges and visually looks similar if the 100,000+ category were further divided. The histogram for stated income has a very long tail on the right due to a number of applicants with incomes in the millions; the max value of annualized income is over $21M. These high-end incomes could be investors getting a feel for the service.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    9085   13500   35000

Loan amounts range from $1,000 to $35,000 with the majority being below $10,000. The distribution of loan amounts is skewed to the right and has spikes every $5,000. It looks like borrowers tend to round to the nearest $5,000.

Next, I will look at how much investors tend to invest per loan.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7.30    63.49   161.29  3783.27  5000.00 35000.00

Each loan is funded by multiple investors. By dividing the loan amount by the number of investors, the median amount invested on a loan is $161 and the mean amount is $3,783. The mean is skewed by some large investments that are outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.1359  0.1875  0.1960  0.2574  0.3600

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.18270  0.07408  0.09170  0.09607  0.11660  0.28370

Each listing gives the loan’s interest rate (BorrowerRate) along with the expected return based on estimated losses (EstimatedReturn). Both of these numbers are annual rates.

Interest rates range from 0.04 to 0.36 with an average of 0.196. The histogram is a bit right skewed with a spike at 0.31.

Estimated returns range from a loss of 0.1827 to a gain of 0.2837 with an average gain of 0.096. The histogram appears to have outliers on both sides of the graph with the bulk of the loans having a positive return.

As a potential investor, I’m also interested in actual returns. The Prosper dataset doesn’t include this variable, so I’ll need to create it.

Looking through the available fields, I see three ways to calculate an actual return. All three returned the same result with one anomaly. When a loan is paid off successfully, the principal payments should equal the loan amount. Sometimes this is not the case. I avoided this anomaly by not using the principal payments filed and calculated the Actual Return as follows:

Starting with the Customer Payments, subtract the Original Loan Amount and the Fees Paid by Investor. Divide the result by the Original Loan Amount.

Since we don’t know the ultimate fate of active loans, the dataset needs to be limited to closed loans. When loans are closed, their status will be Cancelled, Completed, Chargedoff or Defaulted.

## [1] 25989    84
## # A tibble: 3 x 2
##   LoanStatus     n
##       <fctr> <int>
## 1 Chargedoff  5335
## 2  Completed 19650
## 3  Defaulted  1004
## # A tibble: 3 x 2
##   LoanStatus proportion
##       <fctr>      <dbl>
## 1 Chargedoff 0.20527916
## 2  Completed 0.75608911
## 3  Defaulted 0.03863173

There are 25,989 closed loans to analyze. 19,650 were completed, 5,335 were charged-off, 1,004 defaulted, and none were cancelled. About a quarter of the closed loans weren’t completely paid off.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.009095  0.007337  0.097196 -0.014252  0.217154  1.423780

The distribution of actual returns is bi-modal with a smaller grouping just above -1, in the 50-100% loss range, and a larger grouping just above 0, in the 0-50% gain range. On the boxplot, the inter-quartile-range (IQR) is in positive territory, but the mean is in negative territory due to significant loses.

Interestingly, the maximum actual return is 142%, while the maximum estimated return projected by prosper was only 28%. One thing to note is that my calculation of actual return is over the life of the loan, while Prosper’s estimated return is based on the annual interest rate.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    -Inf 0.03232 0.11005     NaN 0.19444     Inf       7

When I try to annualize the actual return, problems arise. For instance, if none of the loan was ever paid, the actual return is -100%. If that loan was defaulted over a 6 month timeframe, the annualized number becomes -200%. Also, if someone paid on time through the first year and then defaulted, the first year would have a positive return followed by a negative return the next year. There’s probably a way to accurately annualize these loans, but I’m not sure how that works with the available data. For now I’ll settle with the overall return without an annualized version that can be compared against the estimates provided by Prosper.

Univariate Analysis

What is the structure of your dataset?

The dataset is loaded into a data frame that contains 113,937 loans from Prosper. There are 81 variables related to each loan.

The LoanOriginationDate field has loans from November, 15, 2005 through March 3, 2014.

The fields CreditGrade, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn, ProsperRating..numeric, ProsperRating..Alpha, and ProsperScore weren’t used until July of 2009. Since some of these fields will be used in the analysis, loans prior to their use were removed from the dataset, leaving 84,837 records to analyze.

The Term of each loan can be 12, 36 or 60 months.

The Status of each loan can be Cancelled, Chargeoff, Completed, Current, Defualted, FinalPaymentInProgress, and PastDue.

The ProsperRating assigned to each loan when it was first listed can be from AA (best) to HR (high risk). The possible values are AA, A, B, C, D, E, HR.

Each loan has a ListingCategory that the borrower selects to indicate how they plan to use the funds. The value provided was numeric. A new field was added with the corresponding value of the field as provided by Prosper. For example, ‘1’ is ‘Debt Consolidation’, and ‘2’ is ‘Home Improvement’.

Other variables viewed include State, Occupation, EmploymentStatus, EmploymentStatusDuration, IncomeRange, StatedMonthlyIncome, LoanOriginalAmount, and LenderYield.

What is/are the main feature(s) of interest in your dataset?

Actual payment detail is key to understand if a loan has a positive return for the investor. These variables start with ‘LP_’. The loan amount and term are also needed.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I hope to better understand which of the following variables will help predict positive returns on new loan listings.

  • ProsperRating, ListingCategory, State, Occupation, EmploymentStatus, EmploymentStatusDuration, IncomeRange, StatedMonthlyIncome, and LenderYield.

Did you create any new variables from existing variables in the dataset?

The Actual Return is calculated on each loan and is meaningful on loans that have closed. This field was calculated by taking the Customer Payments, subtracting the Original Loan Amount and the Fees Paid by Investor, and then dividing the result by the Original Loan Amount.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Chart distributions:

  • The number of loans originated by date had a curious dip at the end of 2012.
  • The distribution on actual returns is bi-modal, representing defaulted loans on the negative side and completed loans on the positive side. Both groupings are skewed to the right.

Data adjustments:

  • Dates were loaded as Factor variables and needed to be converted into a date format.
  • The fields CreditGrade, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn, ProsperRating..numeric, ProsperRating..Alpha, and ProsperScore weren’t used until July of 2009. Loans missing these fields were removed for consistency.
  • The ProsperRating factors had to be re-ordered so they displayed in the correct order.
  • The ListingCategory variable is a numeric field. A new field was added with the corresponding value of the field as provided by Prosper. This was done so the chart would have meaningful labels.
  • The IncomeRange factors had to be re-ordered so they displayed in the correct order.
  • When looking at actual returns, only closed loans were included since active loans don’t have a final return yet.
  • A new variable called StatedMonthlyIncome..Annualized was created by multiplying StatedMonthlyIncome by twelve so stated income could be compared to the IncomeRange field that is also stated as an annual number.

Bivariate Plots Section

Looking at the box plots of actual returns on closed loans by loan status, there is a clear separation of loans that completed successfully from those that didn’t. On loans that defaulted or had a charge off, there are some outliers that still had a positive return. Depending on how much of the loan was paid off and how many late fees were paid before defaulting, this is a possible scenario. A handful of completed loans have a loss, which is unexpected. It’s worth taking a closer look to see how that happened.

There are 16 loans with a completed status and zero return or less. Looking at the fields that drive actual return, seven of the loans were paid off on the same day they originated resulting in zero return. One was paid off a week after origination but didn’t pay any interest for that week. Eight of the loans show payments much less than the original amount with no charge off. The numbers on these eight look like entry errors and will be removed from the analysis since they will reduce the average return on completed loans.

## [1] 25981    86

Removing the questionable loans leaves us with 25,981 closed loans to examine.

Breaking-out the employment durations by employment status helps show how durations vary by status. On the boxplot, the categories of Employed, Full-Time and Self Employed look fairly similar with a median around 6 years. Part-time work tends to last a shorter duration. The category of Not Employed has the shortest duration. All categories have outliers with long durations. On the histogram, you can see that Employed peaks at zero and rapidly decreases, suggesting that folks tend to move around a lot. Full Time status actually peaks after zero with a steady decline over time. The category of Other has a spike at zero. It could be that a lot of folks are starting something new and quickly need money, or it could just be applicants evading the question.

Looking at occupation when employment status is Other might reveal an explanation.

## 
##                    Other                               Computer Programmer 
##                     2469                     1332                        2 
##           Accountant/CPA Administrative Assistant                    Judge 
##                        1                        1                        1

Almost all of the occupations when employment is Other are Other or blank, further suggesting that these are applicants who are evading the question.

Comparing actual returns by employment status shows that the median return is positive for all categories. The employment categories of Full-time, Part-time and Retired have positive means (the ‘x’ on the graph) and their IQRs are in positive territory. The employment categories of Not Employed, Other and Self-employed IQRs are wide and fall deep into negative territory suggesting investors are more likely to lose money on these loans.
Comparing the histograms of actual returns for Full-time and Other shows that Other has more losses relative to gains which pulls the mean away from the median.

## 
##  Pearson's product-moment correlation
## 
## data:  EmploymentStatusDuration and ActualReturn
## t = -0.20308, df = 25970, p-value = 0.8391
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01342182  0.01090181
## sample estimates:
##          cor 
## -0.001260192

Looking at the scatter plot of actual returns by employment duration, there is a pattern of decreasing gains and losses as employment duration increases. That is, the higher gains drop off leaving relatively more lower gains and the lesser losses drop off leaving the greater losses. The drop in gains makes me think the interest rates are lower due to better credit scores. The decrease in data points reflects the decrease in number of borrowers as employment duration increases that we saw earlier.

The red line shows the median actual return for the employment duration, and it shows a gain until the data points become too few for a steady line. The black line is the mean actual return. The blue line is a smoothed version of the mean, and that line varies little just inside the loss side of returns.

Ultimately, there is no correlation between actual return and the duration of employment.

The median of actual returns for almost all Occupations is positive while the means are mixed between gains and losses. The IQRs are all over the place. There are a number of occupations where the IQR is completely in positive territory. College Sophomores look like delinquents, though we can’t see how many loans are represented. It’s tempting to think that some borrower’s occupations on a loan might be a safer investments than others. Interestingly, the category of Other isn’t out of line compared to the other categories and actually has a higher range of returns. If borrowers were using Other to evade the question, I would expect it to show a lower return. There’s probably a better explanation for why this category is selected so often along with employment status of Other.

For borrowers that have an income, loan amounts increase as the incomes increase. Borrowers reporting no income or not employed are requesting loans around the size of the lowest two income ranges.

The median actual return in all income categories is positive, ranging from 0.069 for unemployed to 0.108 for $0 income. The mean returns, indicated by an ‘x’, start at a loss of 0.096 and move closer to the median as incomes increase. The IQR also narrows as incomes increase, most likely due to fewer defaults. It looks like borrowers with an income of at least $50,000 are most likely to have a positive return.

## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome..Annualized and ActualReturn
## t = 8.3947, df = 25979, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03987791 0.06413159
## sample estimates:
##        cor 
## 0.05201242

The scatter plot of actual returns by stated income shows two distinct groupings of loans with a gain and loans with a loss. The groupings look similar to the patterns in the actual returns by employment duration graph done earlier in the analysis. The correlation coefficient does not show a meaningful relationship. The blue line is the smoothed mean of actual returns, and it moves from a loss to a gain around $60,000 in income.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and ActualReturn
## t = -11.588, df = 25979, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08379380 -0.05959938
## sample estimates:
##         cor 
## -0.07170714

There is not a meaningful correlation between the amount borrowed and the investor’s actual return on the loan.

When looking at actual returns by the Prosper ratings, the median return doesn’t shift by a whole lot, but the IQRs significantly expand as risk goes up. The median returns increase as the loans become more risky for the first five categories while the mean returns decrease as risk increases. The final two categories with the most risk actually see a drop in median and mean returns.

A Prosper rating of D provides the highest median return, but the mean return is negative. Only ratings of AA and A have a positive median and mean return. It’s interesting to note that, as risk increases, the difference between the median and mean increases. The mean is lower than the median because of outliers caused by loan defaults. The difference becomes greater because of more defaults that pull the mean down. This makes me think that the Prosper rating is effective.

## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome..Annualized and LoanOriginalAmount
## t = 54.17, df = 84835, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1763331 0.1893414
## sample estimates:
##       cor 
## 0.1828452
## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome..Annualized and LoanOriginalAmount
## t = 132.6, df = 83056, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4123461 0.4235715
## sample estimates:
##       cor 
## 0.4179748

The scatter plot between income and loan amount appears to have a positive correlation, thought the correlation seems to break down with incomes over $100,000. Calculating the correlation coefficient for all incomes and then for incomes under $300,000 improves the coefficient from 0.183 to 0.418.

Looking at the scatterplot matrix, I see a couple more relationships to explore.

For instance, the correlation between term and loan amount suggests that borrowers want to spread larger loans over longer timeframes.

Also, the scatter plot for interest rate vs actual return shows how actual returns vary more and more as interest rates increase. Since higher interest rates are given to loans with higher risk, this pattern makes sense. Comparing the estimated return, which adjusts the interest rate for risk, to actual return changes that pattern to show the estimated return more closely grouped with actual return. This change in pattern shows that the estimated return is more realistic than looking at the interest rate when estimating returns.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The box plots of actual returns by loan status shows a clear separation of loans that were paid off from those that defaulted.

Comparing actual returns by employment status shows that the median return is positive for all categories, while mean and IQR are mixed between positive and negative returns.

The median of actual returns for almost all Occupations is positive while the means are mixed between gains and losses. It’s tempting to think that some borrower’s occupations on a loan application might be a safer investment than others.

The scatter plot of actual returns by stated income shows two distinct groupings of loans by gains and losses. A meaningful correlation was not found.

When looking at actual returns by the Prosper ratings, the median return doesn’t shift by a whole lot, but the inter-quartile-ranges significantly expand as risk goes up.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Breaking-out the employment durations by employment status helped show how durations vary by status. The categories of Employed, Full-Time and Self Employed look fairly similar with a median around 6 years. Part-time work tends to last a shorter duration. The category Not Employed is the shortest duration.

The scatter plot between income and loan amounts appears to have a positive correlation, thought the correlation seems to break down with incomes over $100,000.

What was the strongest relationship you found?

The Prosper rating has a strong relationship to actual returns. Also, loan amounts tend to increase as borrower income increases.

Multivariate Plots Section

This graph distinguish the Prosper rating on the actual return by employment duration scatterplot. The two best Prosper ratings dominate the lower gains along all employment durations. There are very few instances of these ratings having returns greater than 0.25, though there are a fair amount scattered in the loss side of returns. The riskier ratings are more prevalent in the higher gains and lower losses.

Though the number of loans decreases as employment duration increases, all rating categories are represented from 0 months on. The higher risk ratings do seem to drop off faster than the lower risk rating, indicating that employment duration is probably taken into account in the rating. An earlier chart showed that the smoothed mean of actual returns remained slightly negative across all employment durations. The reduction in loans with losses is balanced by the reduction in loans with higher gains.

This chart separates the smoothed mean of actual-return by employment status. Employment statuses of Full-time, Part-time and Retired show gains, while the other four categories show losses.

In this chart, you can see how loan amounts tend to increase as incomes increase. There doesn’t seem to be a pattern within actual returns.

For employees that were recently employed, you can see how income drives the ratings. This could be because these folks have limited credit history, so the rating is strongly based on income. As you go over 100 months of employments, the range of income becomes greater and the ratings become more varied as credit history becomes more robust.

The green line shows the mean actual return on loans with a Completed status. Since most are three year loans, the line starts declining at three years ago. The decline is a result of three year loans that completed early and had less interest paid on the loan than if they went full term. The blue and red lines are the Defaulted and Chargedoff statuses. The more payments that were made on these loans, the lower the loss. The more recent loans would have little paid on them and therefore a large loss, resulting in the narrowing range to larger losses.

Employment status is ordered by highest gains to highest losses as seen in an earlier boxplot. In the first chart, the blue points for Not Employed are lined-up at zero income. The darker brown for Part-time fall in the lower income levels an appear more in the gains than losses. The lighter brown for Full-time are scattered through all incomes and seem to be more prevalent in the higher actual returns, while the white points for Employed are more in the lower returns. The dark green for Other seems to be mostly in the lower incomes and loss side of returns. On this chart, you can see how the bulk of loans are for the Employed and Full-time employment statuses.

The second chart shows the smoothed mean of actual returns by employment status. The line for Not Employed is meaningless since income is always zero. For the most part, mean returns increase as income increases, thought self-employed is flat and part-time actually decreases.

The first chart shows a pattern of when actual returns are a gain. The higher ratings are mostly under a 0.25 actual return and extend out through the higher incomes. As the actual return increases, you can see how the ratings drop since loans with higher risk require a higher interest rate. On the loss side of the chart, defaults in the lower incomes have the lower rating, but higher ratings become more prevalent as incomes increase. In general, defaults become more sparse as incomes rise. As a whole, as incomes increase over $100,000, the number of loans seems to drop off for the lower Prosper ratings. Higher rated loans appear to drop off after $150,000.

There is a nice grouping of high risk loans with returns over 0.5 when incomes are under $100,000. Though this seems like a great opportunity for really good returns, the previous chart for return vs income and returns vs ratings shows that the losses pull average returns into a loss. The median return is a gain, but over time you will lose money on high risk loans.

The second chart adds smoothed means by Prosper rating to the previous scatterplot using a different color scheme. The best two ratings start with positive returns and increase slowly as incomes increase. The remaining five ratings start with a loss and are all showing gains before income reaches $100,000. It looks like favorable returns are likely on risky loans when the borrower’s income is over $80,000. From earlier charts, high risk loans with high incomes occur less frequently.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

For actual returns, the two best Prosper ratings dominate the lower gains along all employment durations. The riskier ratings are more prevalent in the higher gains and are spread out across losses. As duration increases, the higher risk ratings seem to drop off faster than the lower risk rating.

Displaying the smoothed means of actual returns by employment status over employment durations shows that employment statuses of Full-time, Part-time and Retired are most likely to return gains.

As borrower income increases, actual returns by employment status tend to increase. Full-time and Retired employment start with gains and increase as duration increases. Employed and Other start with loses and increase to gains around $120,000 income. Self-employed stays on the loss side with little change. Part-time workers start with a nice gain and show decreasing gains as income increases.

When looking at the smoothed means of actual returns by income and Prosper rating, returns increase as income increases for all ratings. Though the top two rating show gains across all incomes, all ratings show a gain on returns as incomes increase over $80,000.

Were there any interesting or surprising interactions between features?

It’s interesting to see that the Prosper rating is a good indication of risk on a loan. You can also see how greater risk can result in higher returns, but that same risk can also result in greater losses that ultimately hurt overall returns.


Final Plots and Summary

Plot One

Description One

This boxplot shows the median, mean and IQR of actual returns by income range, reflecting how the borrower’s income affects the potential return on loans. Income above $50,000 is where the median and mean are a gain.

Plot Two

Description Two

This boxplot shows the median, mean and IQR of actual returns by Prosper rating, reflecting how the rating is a good predictor of the potential return on loans. Only ratings of AA and A have a median and mean gain.

Plot Three

Description Three

This scatterplot with smoothed means of actual returns brings the previous two plots together for a nice picture of what income and Prosper rating combinations will likely, on average, have a positive actual return. The smoothed mean for AA and A rated loans show a gain for all incomes. The smoothed mean for the remaining ratings moves from a loss to a gain in the $60,000 to $100,000 range.


Reflection

When I started exploring this dataset, I went to Prosper’s website to understand what Prosper does. The prospectus on the website helped to put the dataset into context and to better understand the variables.

As I started exploring the variables, I quickly realized that the actual return was not provided on loans that were closed. Deciding how to best calculate the actual return took some time, and I found it easiest to write some records of closed loans to an Excel file so I could more easily try different formulas. I do a lot of work in Excel for my current job, and ‘seeing’ the data through command lines has been a big shift for me. By the end of this project, I found myself quickly writing commands to see the data in R. Being immersed in RStudio to complete this project has helped me develop a feel for the environment.

I was surprised to find inconsistencies in the dataset. For instance, a Completed loan should have principal paid equal to the original loan amount, but there were some cases where these numbers weren’t even close. Sometimes the Customer Payments field seemed to include the full amount, but sometimes the missing money wasn’t accounted for in any of the other fields. Overall, I believe the dataset is accurate, but finding such inconsistencies makes me wonder what other issues I didn’t find that skew results.

The boxplot of actual returns by income range seems to show a clear pattern of improved returns as incomes increased, yet calculating the correlation to stated-income didn’t show a meaningful correlation. Reshaping the data in some way would probably reveal the answer, and further separating loans by Prosper rating might reveal a correlation that could then be used for a linear model. There are still many combinations of variables to explore.

It would be interesting to model a random portfolio of loans to see how the investment would turn out. The dataset doesn’t provide the actual payment and fees dates, but there should be enough information to determine an overall return. Based on what I found, I would invest $1,000 by spreading $100 across 10 loans that meet the following criteria: Any AA or A rated loan, or B rated loans with borrower’s income at least $60,000, or any subsequent Prosper rating adding $10,000 to income for each step. It would be good to include a cap on loan size based on income as well.

One final thought on the results. According to an article I found at HuffingtonPost.com, the National Bureau of Economic Research states that the Great Recession officially began in December 2007 and ended in June 2009. The dataset in this analysis starts in August 2009 and ends in March 2014, which has been a time of recovery. How would these results change in times of expansion or another downturn? Prosper is new and doesn’t have history in those environments. If investing a significant amount of money, it would be worth seeing how other lending platforms perform in different economic conditions.